RNA-Seq 101 for Biologists
Interested in transcriptomics / RNA-seq but not sure where to get started? Maybe you're already running RNA-Seq experiments but want to better understand the assay and downstream analysis steps. This post will help biologists and bioinformaticists understand the foundations for designing an RNA-Seq experiment and analyzing your data to maximize novel insight gained.
Background: A work-horse technique
Since the early 2000s, RNA sequencing (RNA-seq) has powered the study of gene expression. RNA-seq provides a comprehensive view of the genome-wide transcriptome, allowing researchers to identify and quantify transcripts (RNA molecules) in a sample at a given moment. This technique surpassed traditional methods like microarrays in sensitivity, range, and the ability to discover novel transcripts.
In a nutshell - RNA-seq Method
Ribonucleic acid sequencing (RNA-Seq)
- Measures the quantity and sequences of RNA in a biological sample
- Provides insights into gene expression, splicing, and transcript variants
- Sequences are used to identify differential expression, novel transcripts, and RNA modifications
- Enables a deeper understanding of cellular processes and disease mechanisms
Applications of RNA-seq
RNA-Seq is widely used for various applications, including:
- Differential gene expression analysis: Comparing gene expression levels between different conditions or treatments
- Transcriptome profiling: Cataloging all the transcripts present in a cell or tissue
- Alternative splicing analysis: Identifying different splicing events and isoforms
- Fusion gene detection: Detecting gene fusions, which are common in cancers
- Single-cell RNA-seq: Profiling gene expression at the single-cell level
Reagents & materials for RNA-seq
- RNA extraction kit: Reagents and protocol for isolating high-quality RNA from your samples
- Library preparation kit: Reagents and protocol for converting RNA into a sequencing-compatible library
- Sequencing platform: The instrument from Illumina, PacBio, Oxford Nanopore, etc for sequencing the libraries
- Computational resources: You'll need compute resources for data analysis and storage. You can purchase and run these yourself, or use a cloud-based platform like Pluto, which handles all compute and storage so you don't have to.
Designing your RNA-seq experiment
A well-designed RNA-seq experiment includes:
- Biological replicates: Include multiple replicates (ideally > 3 per group) to ensuring statistical power and reproducibility. You must have a minimum of 3 samples per group to run algorithms like
DESeq2
for differential expression. - Controls: Including untreated, vehicle, WT, or other appropriate baseline samples will enable you to compare against experimental conditions.
- RNA quality: High-quality RNA with minimal degradation is crucial for accurate results. Ideally, you should measure this following extraction to ensure your samples are high quality before paying for sequencing.
- Library complexity: Ensure adequate sequencing depth to capture low-abundance transcripts.
Analyzing RNA-Seq data
Pipeline: FASTQ to counts
Note: The word "pipeline" is used in a variety of contexts in bioinformatics. In this section, we are using it to refer only to the set of steps that transform raw data into a raw counts matrix. Downstream steps for analysis will be discussed in subsequent sections.
After you've prepared your samples and sent them off for sequencing, you'll receive FASTQ files. These large, raw sequencing files need to be processed through a multi-step RNA-seq pipeline to ultimately generate gene expression counts.
Common steps in an RNA-seq pipeline (and some of the software used to perform them) include:
- Quality control (QC): FASTQC and MultiQC
- Adapter trimming: TrimGalore!
- Alignment to reference genome: HISAT2 or STAR
- Transcript assembly: StringTie or Cufflinks
- Quantification: FeatureCounts or HTSeq
Running the above tools yourself will require computational infrastructure and coding expertise. Expect the RNA-seq pipeline to run in a few hours per sample, depending on your compute resources and parallelization approach.
Want to get to insights faster? With Pluto, you can run an end-to-end RNA-seq pipeline in your browser, with no infrastructure or coding required. Learn more with a live, 15-minute demo.
The output of this initial pipeline should be a count matrix, as shown below:
A count matrix contains gene symbols in the first column, and raw count values for each sample in subsequent columns. Raw counts represent the number of reads that mapped to a given gene in a given sample. These values will be affected by differences in sequencing depth between samples and have not yet been normalized, so note that they should not be used directly to compare across samples. However, the algorithms that we'll use downstream for analysis typically take this raw count matrix as input.
For single cell RNA-seq experiments, the raw count matrix is very large and sparse, so it is typically not stored in the tabular format shown above, but rather in a compressed format such as a Seurat object.
Sample annotations
Before analyzing your data, you'll want to create a table containing sample annotations. This should include a unique ID for each sample, and then information about each sample that's specific to your experimental design, such as an treatments, timepoints, or other biological factors in your experiment. It's also good to record replicate numbers and columns indicating which FASTQ file(s) came from each sample.
Here's an example sample annotation table from a bulk RNA-seq experiment in which WT or KD samples were analyzed from 2 different donors:
Now that we've recorded the different sample attributes, we're ready to move on to analysis!
Analyses and visualization
There are a wide variety of analyses and visualizations you can use to investigate your RNA-seq data, all of which address different scientific questions.
Some of the most common RNA-seq analyses include:
- Differential expression analysis: Identify gene expression changes between conditions using algorithms such as
DESeq2
orlimma
. This analysis typically produces a volcano plot or heatmap. - Pathway analysis: Identify pathway-level changes using algorithms such as Gene Set Enrichment Analysis (GSEA). This analysis can produce enrichment plots, bar plots of top enriched pathways, and more.
- Dimensionality Reduction (PCA, UMAP, t-SNE): Visualize how samples cluster based on gene expression using algorithms such as principal components analysis (PCA), uniform manifold approximation and projection (UMAP), or t-distributed stochastic neighbor embedding (t-SNE).
These analyses are often run using Python or R scripts. You can also create all of these plots (and more!) in Pluto with no coding required. Talk with our team to get started running bioinformatics analyses to create the interactive plots shown in this blog, all in a biologist-friendly experience.
Ready to analyze your RNA-seq data?
Thanks for reading this brief overview of RNA-seq experiments! To learn more about how your team can collaboratively analyze transcriptomics data, as well as epigenetics and other -omics data in your browser with Pluto, chat with our team to get started today.
References & additional resources
- Wang, Z., Gerstein, M., & Snyder, M. RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics, 10(1), 57-63 (2009). Link
- Trapnell, C., Williams, B. A., Pertea, G., Mortazavi, A., Kwan, G., van Baren, M. J., ... & Pachter, L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology, 28(5), 511-515 (2010). Link
- Love, M. I., Huber, W., & Anders, S. Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biology, 15(12), 550 (2014). Link